NLP training example

In this example, we'll train an NLP model for sentiment analysis of tweets using spaCy.

First we download spaCy language libraries.


In [1]:
!python -m spacy download en_core_web_sm


Requirement already satisfied: en_core_web_sm==2.2.5 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz#egg=en_core_web_sm==2.2.5 in /Users/miliu/Documents/modeldb/demos/webinar-2020-5-6/02-mdb_versioned/venv/lib/python3.7/site-packages (2.2.5)
Requirement already satisfied: spacy>=2.2.2 in /Users/miliu/Documents/modeldb/demos/webinar-2020-5-6/02-mdb_versioned/venv/lib/python3.7/site-packages (from en_core_web_sm==2.2.5) (2.2.4)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /Users/miliu/Documents/modeldb/demos/webinar-2020-5-6/02-mdb_versioned/venv/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.0.2)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /Users/miliu/Documents/modeldb/demos/webinar-2020-5-6/02-mdb_versioned/venv/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (2.23.0)
Requirement already satisfied: setuptools in /Users/miliu/Documents/modeldb/demos/webinar-2020-5-6/02-mdb_versioned/venv/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (41.2.0)
Requirement already satisfied: thinc==7.4.0 in /Users/miliu/Documents/modeldb/demos/webinar-2020-5-6/02-mdb_versioned/venv/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (7.4.0)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /Users/miliu/Documents/modeldb/demos/webinar-2020-5-6/02-mdb_versioned/venv/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (2.0.3)
Requirement already satisfied: numpy>=1.15.0 in /Users/miliu/Documents/modeldb/demos/webinar-2020-5-6/02-mdb_versioned/venv/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.18.1)
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /Users/miliu/Documents/modeldb/demos/webinar-2020-5-6/02-mdb_versioned/venv/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.0.2)
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /Users/miliu/Documents/modeldb/demos/webinar-2020-5-6/02-mdb_versioned/venv/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.0.0)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /Users/miliu/Documents/modeldb/demos/webinar-2020-5-6/02-mdb_versioned/venv/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (3.0.2)
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /Users/miliu/Documents/modeldb/demos/webinar-2020-5-6/02-mdb_versioned/venv/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.1.3)
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /Users/miliu/Documents/modeldb/demos/webinar-2020-5-6/02-mdb_versioned/venv/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (0.6.0)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /Users/miliu/Documents/modeldb/demos/webinar-2020-5-6/02-mdb_versioned/venv/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (4.43.0)
Requirement already satisfied: blis<0.5.0,>=0.4.0 in /Users/miliu/Documents/modeldb/demos/webinar-2020-5-6/02-mdb_versioned/venv/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (0.4.1)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /Users/miliu/Documents/modeldb/demos/webinar-2020-5-6/02-mdb_versioned/venv/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_sm==2.2.5) (1.25.8)
Requirement already satisfied: certifi>=2017.4.17 in /Users/miliu/Documents/modeldb/demos/webinar-2020-5-6/02-mdb_versioned/venv/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_sm==2.2.5) (2019.11.28)
Requirement already satisfied: chardet<4,>=3.0.2 in /Users/miliu/Documents/modeldb/demos/webinar-2020-5-6/02-mdb_versioned/venv/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_sm==2.2.5) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /Users/miliu/Documents/modeldb/demos/webinar-2020-5-6/02-mdb_versioned/venv/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_sm==2.2.5) (2.9)
Requirement already satisfied: importlib-metadata>=0.20; python_version < "3.8" in /Users/miliu/Documents/modeldb/demos/webinar-2020-5-6/02-mdb_versioned/venv/lib/python3.7/site-packages (from catalogue<1.1.0,>=0.0.7->spacy>=2.2.2->en_core_web_sm==2.2.5) (1.5.0)
Requirement already satisfied: zipp>=0.5 in /Users/miliu/Documents/modeldb/demos/webinar-2020-5-6/02-mdb_versioned/venv/lib/python3.7/site-packages (from importlib-metadata>=0.20; python_version < "3.8"->catalogue<1.1.0,>=0.0.7->spacy>=2.2.2->en_core_web_sm==2.2.5) (3.1.0)
WARNING: You are using pip version 19.2.3, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
✔ Download and installation successful
You can now load the model via spacy.load('en_core_web_sm')

And import the boilerplate code.


In [2]:
from __future__ import unicode_literals, print_function

import boto3
import json
import numpy as np
import pandas as pd
import spacy

Data prep

Download the dataset from S3.


In [3]:
S3_BUCKET = "verta-strata"
S3_KEY = "english-tweets.csv"
FILENAME = S3_KEY

boto3.client('s3').download_file(S3_BUCKET, S3_KEY, FILENAME)

Clean and load data using our library.


In [4]:
import utils

data = pd.read_csv(FILENAME).sample(frac=1).reset_index(drop=True)
utils.clean_data(data)

data.head()


Out[4]:
text sentiment
0 no, it's just bleurgh 1
1 YAY awesome news. I love Gavin and Stacey! Als... 0
2 so ready for the states!! cant wait to take off.. 1
3 lazy bum! hehe! Where do you work? 1
4 I would like to say good morning tweets!! If y... 0

Train the model

We'll use a pre-trained model from spaCy and fine tune it in our new dataset.


In [5]:
nlp = spacy.load('en_core_web_sm')

Update the model with the current data using our library.


In [6]:
import training

training.train(nlp, data, n_iter=20)


Using 16000 examples (12800 training, 3200 evaluation)
Training the model...
LOSS 	  P  	  R  	  F  
15.932	0.754	0.718	0.736
0.367	0.746	0.750	0.748
0.110	0.755	0.744	0.749
0.096	0.761	0.741	0.751
0.085	0.759	0.740	0.749
0.073	0.758	0.733	0.745
0.062	0.748	0.731	0.740
0.052	0.743	0.722	0.733
0.046	0.748	0.725	0.736
0.039	0.751	0.718	0.734
0.032	0.744	0.720	0.732
0.031	0.743	0.719	0.731
0.026	0.740	0.719	0.730
0.023	0.738	0.718	0.728
0.022	0.729	0.713	0.721
0.019	0.728	0.716	0.722
0.019	0.734	0.717	0.726
0.018	0.734	0.718	0.726
0.017	0.733	0.718	0.726
0.016	0.732	0.717	0.724

Now we save the model back into S3 to a well known location (make sure it's a location you can write to!) so that we can fetch it later.


In [7]:
filename = "/tmp/model.spacy"
with open(filename, 'wb') as f:
    f.write(nlp.to_bytes())

In [8]:
boto3.client('s3').upload_file(filename, S3_BUCKET, "models/01/model.spacy")

In [9]:
filename = "/tmp/model_metadata.json"
with open(filename, 'w') as f:
    f.write(json.dumps(nlp.meta))

In [10]:
boto3.client('s3').upload_file(filename, S3_BUCKET, "models/01/model_metadata.json")

Deployment

Great! Now you have a model that you can use to run predictions against. Follow the next step of this tutorial to see how to do it.